Measures of Dispersion

Introduction

In the previous chapter, we studied measures of central tendency, which provide a single value to represent the center of a dataset. However, an average alone is not sufficient to describe a distribution completely. It tells us about the central location but reveals nothing about how the individual observations are scattered or spread around that central value. This scattering of data is called dispersion.

Consider two cricket batsmen, A and B, who have both scored an average of 50 runs in their last five matches. Based on the mean, their performance seems identical. But let's look at their individual scores:

Batsman A: 48, 50, 52, 49, 51
Batsman B: 10, 90, 100, 0, 50

While their average is the same, Batsman A is highly consistent, with scores clustered closely around the mean. Batsman B, on the other hand, is very inconsistent, with scores spread far and wide. We would consider Batsman A to be more reliable. This example highlights the need to study not just the average, but also the variability or dispersion of the data. Measures of dispersion quantify the extent to which data points in a distribution differ from the average, giving us a measure of consistency and reliability.

Measures Based Upon Spread Of Values

These are measures of dispersion that are calculated based on the spread or distance between specific values in the dataset, without reference to a central average. The two main measures in this category are Range and Quartile Deviation.

Range

The Range is the simplest possible measure of dispersion. It is defined as the difference between the highest (Largest) value and the lowest (Smallest) value in a dataset.

Formula:

$ \text{Range} = L - S $

where $L$ is the largest value and $S$ is the smallest value.

Range: Comments

The main advantage of the range is its simplicity; it is very easy to calculate and understand. However, it is a crude and often unreliable measure of dispersion because it is based only on the two most extreme values. It completely ignores the distribution of all other observations in the dataset. A single outlier (an unusually high or low value) can drastically affect the range, giving a misleading picture of the overall variability.

Quartile Deviation

To overcome the drawback of the range, we use a measure that is based on the middle 50% of the data, thereby ignoring the extreme values. The Interquartile Range is the difference between the third quartile ($Q_3$) and the first quartile ($Q_1$).

$ \text{Interquartile Range} = Q_3 - Q_1 $

The Quartile Deviation (Q.D.), also known as the semi-interquartile range, is half of the interquartile range.

Formula:

$ \text{Quartile Deviation (Q.D.)} = \frac{Q_3 - Q_1}{2} $

Q.D. is a better measure of dispersion than the range as it is not affected by extreme observations.

Example 1. The marks of 9 students are: 45, 50, 35, 62, 75, 48, 55, 80, 42. Calculate the Range and Quartile Deviation.

Answer:

First, arrange the data in ascending order: 35, 42, 45, 48, 50, 55, 62, 75, 80.

Range:

Largest value (L) = 80; Smallest value (S) = 35.

$ \text{Range} = 80 - 35 = 45 $.

Quartile Deviation:

Number of observations ($n$) = 9.

$ Q_1 = \text{Size of } \left(\frac{n+1}{4}\right)^{th} \text{ item} = \text{Size of } \left(\frac{9+1}{4}\right)^{th} = 2.5^{th} \text{ item} $

$ Q_1 = 2^{nd} \text{ item} + 0.5 \times (3^{rd} \text{ item} - 2^{nd} \text{ item}) = 42 + 0.5 \times (45 - 42) = 42 + 1.5 = 43.5 $.

$ Q_3 = \text{Size of } 3\left(\frac{n+1}{4}\right)^{th} \text{ item} = \text{Size of } 3(2.5)^{th} = 7.5^{th} \text{ item} $

$ Q_3 = 7^{th} \text{ item} + 0.5 \times (8^{th} \text{ item} - 7^{th} \text{ item}) = 62 + 0.5 \times (75 - 62) = 62 + 6.5 = 68.5 $.

$ \text{Quartile Deviation (Q.D.)} = \frac{Q_3 - Q_1}{2} = \frac{68.5 - 43.5}{2} = \frac{25}{2} = 12.5 $.

Measures Of Dispersion From Average

These measures of dispersion quantify the variation in a dataset by measuring the average deviation of the individual observations from a central value (like the mean or median). They are considered superior to range and Q.D. because they take every observation into account.

Mean Deviation

The Mean Deviation (M.D.) is defined as the arithmetic mean of the absolute deviations of the observations from a measure of central tendency (either mean or median). We use absolute deviations (ignoring the positive or negative signs) because the sum of deviations from the arithmetic mean is always zero.

Calculation of Mean Deviation from Mean for Ungrouped Data:

$ \text{M.D.}(\bar{x}) = \frac{\sum |x_i - \bar{x}|}{n} $

Calculation of Mean Deviation from Median for Ungrouped Data:

$ \text{M.D.}(\text{Median}) = \frac{\sum |x_i - \text{Median}|}{n} $

Mean Deviation from Mean for Continuous Distribution

For a continuous frequency distribution, the formula is:

$ \text{M.D.}(\bar{x}) = \frac{\sum f_i |m_i - \bar{x}|}{N} $

where $m_i$ is the mid-point of the class, $f_i$ is the frequency, and $N = \sum f_i$. A similar formula exists for M.D. from the median.

Mean Deviation: Comments

Mean Deviation is a better measure than range or Q.D. as it is based on all observations. However, the procedure of ignoring the signs of the deviations is mathematically unsound and makes it difficult to use in further algebraic treatments. This limitation led to the development of the Standard Deviation.

Standard Deviation

The Standard Deviation (S.D.) is the most important and widely used measure of dispersion. It is defined as the positive square root of the arithmetic mean of the squared deviations of the observations from their arithmetic mean. It is denoted by the Greek letter sigma ($ \sigma $).

Squaring the deviations ($x_i - \bar{x}$) overcomes the problem of the sum of deviations being zero, as all squared values are positive. The square of the standard deviation is called Variance ($ \sigma^2 $).

$ \text{Variance} (\sigma^2) = \frac{\sum (x_i - \bar{x})^2}{n} $

$ \text{Standard Deviation} (\sigma) = \sqrt{\frac{\sum (x_i - \bar{x})^2}{n}} $

Calculation of Standard Deviation for Ungrouped Data

There are several methods, but the most direct involves finding the deviations from the actual mean.

Standard Deviation in Continuous Frequency Distribution

For a frequency distribution, the formulas are adapted by incorporating the frequencies ($f_i$) and using mid-points ($m_i$) for class intervals. The step-deviation method is the most efficient for calculation.

Step-Deviation Method Formula:

$ \sigma = \sqrt{ \frac{\sum f_i d'_i{}^2}{N} - \left(\frac{\sum f_i d'_i}{N}\right)^2 } \times c $

where $d'_i = \frac{m_i - A}{c}$, A is the assumed mean, and c is the class width.

Example 2. Calculate the Mean, Variance, and Standard Deviation for the following data on the daily wages of 50 workers.

Wages (₹)	100-120	120-140	140-160	160-180	180-200
No. of Workers	5	15	20	8	2

Answer:

We use the Step-Deviation method. Let Assumed Mean (A) = 150, Class Width (c) = 20.

Wages (₹)	$f_i$	$m_i$	$d'_i = \frac{m_i - 150}{20}$	$f_i d'_i$	$d'_i{}^2$	$f_i d'_i{}^2$
100-120	5	110	-2	-10	4	20
120-140	15	130	-1	-15	1	15
140-160	20	150	0	0	0	0
160-180	8	170	1	8	1	8
180-200	2	190	2	4	4	8
Total	$N=50$			$\sum f_i d'_i = -13$		$\sum f_i d'_i{}^2 = 51$

1. Arithmetic Mean ($\bar{x}$):

$ \bar{x} = A + \frac{\sum f_i d'_i}{N} \times c = 150 + \frac{-13}{50} \times 20 = 150 - \frac{260}{50} = 150 - 5.2 = 144.8 $

The mean daily wage is ₹ 144.8.

2. Variance ($\sigma^2$):

$ \sigma^2 = \left[ \frac{\sum f_i d'_i{}^2}{N} - \left(\frac{\sum f_i d'_i}{N}\right)^2 \right] \times c^2 = \left[ \frac{51}{50} - \left(\frac{-13}{50}\right)^2 \right] \times 20^2 $

$ \sigma^2 = \left[ 1.02 - \left(-0.26\right)^2 \right] \times 400 = [1.02 - 0.0676] \times 400 = 0.9524 \times 400 = 380.96 $

The variance is 380.96.

3. Standard Deviation ($\sigma$):

$ \sigma = \sqrt{\text{Variance}} = \sqrt{380.96} \approx 19.52 $

The standard deviation of daily wages is ₹ 19.52.

Standard Deviation: Comments

Standard Deviation is considered the best and most powerful measure of dispersion. It is based on all observations, is rigidly defined, and is suitable for further mathematical treatment, forming the basis for many advanced statistical concepts like correlation, regression, and hypothesis testing.

Absolute And Relative Measures Of Dispersion

The measures of dispersion we have discussed so far (Range, Q.D., M.D., S.D.) are all absolute measures. They are expressed in the same units as the original data (e.g., ₹, kg, cm). Absolute measures are useful for describing the variability of a single dataset. However, they are not suitable for comparing the variability of two or more datasets that are expressed in different units or have very different means.

To make such comparisons, we use relative measures of dispersion. These are unit-free ratios or percentages, obtained by dividing the absolute measure of dispersion by an appropriate average.

Absolute Measure	Relative Measure (Coefficient)	Formula
Range	Coefficient of Range	$ \frac{L - S}{L + S} $
Quartile Deviation	Coefficient of Quartile Deviation	$ \frac{Q_3 - Q_1}{Q_3 + Q_1} $
Mean Deviation	Coefficient of Mean Deviation	$ \frac{\text{M.D.}(\bar{x})}{\bar{x}} $ or $ \frac{\text{M.D.}(\text{Median})}{\text{Median}} $
Standard Deviation	Coefficient of Variation (C.V.)	$ \frac{\sigma}{\bar{x}} \times 100 $

The Coefficient of Variation (C.V.) is the most important and commonly used relative measure of dispersion. It expresses the standard deviation as a percentage of the arithmetic mean. It is used to compare the consistency or stability of different datasets. A lower C.V. indicates greater consistency (less variability), while a higher C.V. indicates lesser consistency (more variability).

Lorenz Curve

The Lorenz Curve is a graphical measure of dispersion, used primarily in economics to represent the inequality of a distribution, such as the distribution of income or wealth in a society.

Construction Of The Lorenz Curve

Take the variables under study (e.g., income and number of persons).
Express the data and their corresponding frequencies as cumulative percentages.
Take the cumulative percentage of the number of persons (or households) on the X-axis and the cumulative percentage of income (or wealth) on the Y-axis. Both axes range from 0 to 100.
Draw a diagonal line joining the origin (0,0) to the point (100,100). This is called the Line of Equal Distribution. This line represents a situation of perfect equality, where, for example, 10% of the people have 10% of the income, 50% of the people have 50% of the income, and so on.
Plot the cumulative percentage points from the data on the graph.
Join these points with a smooth curve. This curve is the Lorenz Curve.

A Lorenz Curve showing the Line of Equal Distribution and a curve representing the actual distribution of income, with the area of inequality highlighted.

Studying The Lorenz Curve

The Lorenz curve always lies below the Line of Equal Distribution. The extent to which the curve bows away from the diagonal line indicates the degree of inequality. The farther the Lorenz Curve is from the Line of Equal Distribution, the greater the inequality in the distribution. If two Lorenz curves are drawn on the same graph, the one that is closer to the line of equality represents a more equitable distribution.

Conclusion

While measures of central tendency provide a single point of summary for a dataset, measures of dispersion provide a crucial understanding of its internal structure and variability. They tell us how representative the average is and how consistent the data is.

From the simple Range to the robust Standard Deviation, each absolute measure provides a different lens to view the spread. Relative measures, especially the Coefficient of Variation, empower us to compare the variability of different datasets on a common scale. Finally, the Lorenz Curve offers a powerful visual tool to understand inequality in distributions.

Together, measures of central tendency and measures of dispersion provide a comprehensive numerical description of a dataset, forming the backbone of descriptive statistics and enabling deeper, more nuanced analysis.